fix: handle tool execution timeout/error causing IllegalStateExceptio…#956
Conversation
|
凡勇 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
f3080ad to
86c49aa
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
0a3e447 to
0684fd6
Compare
agentscope-ai#951) ReActAgent throws IllegalStateException when tool calls timeout or fail, because no tool result is written to memory, leaving orphaned pending tool call states that crash the agent on subsequent requests. Root cause: - Tool execution timeout/error propagates without writing results to memory - Pending tool call state remains, blocking subsequent doCall() invocations - validateAndAddToolResults() throws when user message has no tool results Changes: - doCall(): detect pending tool calls without user-provided results and auto-generate error results to clear the pending state - executeToolCalls(): add onErrorResume to catch tool execution failures and generate error tool results instead of propagating exceptions - Add generateAndAddErrorToolResults() helper to create error results for orphaned pending tool calls This ensures the agent recovers gracefully from tool failures instead of crashing, and the model receives proper error feedback to continue processing. Closes agentscope-ai#951
0684fd6 to
a6fd753
Compare
LearningGp
left a comment
There was a problem hiding this comment.
Handling tool exceptions as ToolResult seems like a solid approach. For pending tool calls where no result is provided, I’m wondering if it might be more appropriate to expose those to the developer for handling instead. Also, perhaps we could consider adding a configurable exception handler mechanism in the future? (Just a thought—this last point definitely doesn't need to block the PR).
agentscope-core/src/main/java/io/agentscope/core/ReActAgent.java
Outdated
Show resolved
Hide resolved
agentscope-core/src/main/java/io/agentscope/core/ReActAgent.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Fixes ReActAgent resiliency when tool execution fails (timeout/error) by ensuring pending tool-call state is cleared via synthetic error tool results, preventing IllegalStateException on subsequent calls.
Changes:
- Update
ReActAgent#doCall()to detect pending tool calls without user-provided tool results and auto-generate error tool results to clear pending state. - Update
ReActAgent#executeToolCalls()to convert tool-execution failures into error tool results viaonErrorResumeinstead of propagating exceptions. - Update
HookStopAgentTestexpectations to validate the new auto-recovery behavior (no longer expecting an exception).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| agentscope-core/src/main/java/io/agentscope/core/ReActAgent.java | Adds auto-recovery for orphaned pending tool calls and converts tool execution failures into error tool results. |
| agentscope-core/src/test/java/io/agentscope/core/hook/HookStopAgentTest.java | Updates tests to expect auto-recovery rather than IllegalStateException when pending tool calls exist. |
agentscope-core/src/test/java/io/agentscope/core/hook/HookStopAgentTest.java
Show resolved
Hide resolved
agentscope-core/src/test/java/io/agentscope/core/hook/HookStopAgentTest.java
Show resolved
Hide resolved
agentscope-core/src/main/java/io/agentscope/core/ReActAgent.java
Outdated
Show resolved
Hide resolved
agentscope-core/src/main/java/io/agentscope/core/ReActAgent.java
Outdated
Show resolved
Hide resolved
agentscope-core/src/main/java/io/agentscope/core/ReActAgent.java
Outdated
Show resolved
Hide resolved
- Extract shared buildErrorToolResult() helper to deduplicate ToolResultBlock construction - Route generateAndAddErrorToolResults() through PostActingEvent hook pipeline for consistent tool-result lifecycle (StreamingHook TOOL_RESULT emission, hook-based transforms) - Narrow onErrorResume catch scope to Exception.class, letting critical JVM errors (e.g. OutOfMemoryError) propagate - Use ExceptionUtils.getErrorMessage() for non-null error messages and log the exception object itself for full stack traces - Strengthen HookStopAgentTest auto-recovery assertions: verify error ToolResultBlock in memory, model re-invocation, and response content
LearningGp
left a comment
There was a problem hiding this comment.
The changes to executeToolCalls look solid, as they allow the model to proceed even after a tool call exception. However, I’m not sure about adding automatic recovery to doCall. In my view, it might be better to surface these exceptions to the framework consumers instead. Open to discussion on this point!
This fix not only ensures execution continues after tool call timeouts and exceptions, but also resolves a critical issue where the entire agent would become unusable following a timeout, repeatedly throwing IllegalStateException. I have been running this updated version in our production environment for some time now, and it has been performing flawlessly. |
It seems that the modifications to executeToolCalls should be sufficient to ensure a ToolResultBlock is present following a timeout or exception, which would prevent the persistent IllegalStateException. However, I have some reservations about the auto-supplement logic in doCall. Since HITL (Human-In-The-Loop) workflows require a ToolResultBlock to be manually provided when resuming a conversation, automating this process might lead to unintended execution paths. It could also potentially mask improper HITL usage, making it harder to detect implementation errors. It’s great to hear that this fix has been verified in production! A couple of follow-up questions:
|
Thanks for the thorough review! I appreciate the concern about HITL workflows — preserving the explicit contract for manual Let me address your questions: Regarding scenarios where
|
Thanks for the thorough review! I've carefully reconsidered this, and I believe this is fundamentally a bug fix rather than an enhancement — adding configuration parameters would not be the right approach. Why this is a critical bug (not just an enhancement)The current behavior causes complete conversation failure on any tool execution error:
This violates basic fault isolation principles — a single tool failure should not crash the entire agent permanently. Two key reasons why auto-supplement is necessaryFirst, tool execution failures should not leave the agent in a permanently broken state. Whether it's a timeout, network error, or unexpected exception, the agent must recover and continue functioning. The "auto-supplement" in Second, manually generating error results and feeding them to the LLM is essential for proper decision-making. Without this feedback, the model has no visibility into what happened. By providing explicit error messages (e.g., "[ERROR] Tool execution failed: timeout"), we enable the model to:
This is not masking errors — it's proper error propagation to the model layer. Regarding HITL concernsThe HITL workflow concern is valid, but I'd argue:
Summary
The combination ensures robust error handling while keeping the implementation clean and deterministic. I'm happy to discuss alternative HITL integration patterns if needed, but I believe the core fix should remain as-is. Would appreciate your thoughts on this perspective. |
Thanks for the feedback. I agree with your take on the fault isolation principle. One nuance to consider: the current autocompletion doesn't differentiate between the root causes of a missing ToolResult. This is precisely why I previously mentioned the HITL (Human-in-the-loop) mechanism. We need to decide where to propagate exceptions: to the model (via autocompletion) or to the system/developer. Model propagation works well for recoverable errors (timeouts, execution failures) where the model can retry or select alternative tool. But for non-recoverable issues (bugs, API misuse or other system-level failures), system-side propagation is likely more appropriate. Refining this would require mapping out specific scenarios, so for this iteration, propagating all cases to the model seems like a reasonable baseline. @chickenlj PTAL |
Thanks for the clear guidance on error categorization. I fully agree that differentiating recoverable vs non-recoverable errors is the right long-term direction. In the latest commit, I've made several improvements that partially address this:
For the error categorization refinement (routing certain exception types to system-side propagation), I think that fits well as a follow-up iteration once we have concrete scenarios mapped out. The current hook-based extensibility should cover most custom handling needs in the meantime. Could you take another look when you get a chance? |
Works for me. |
…Resume Avoid swallowing InterruptedException in the onErrorResume handler. In AgentScope, InterruptedException is the cooperative interruption signal used by the agent stop policy. Converting it into an error tool result would silently break the interruption mechanism. Now InterruptedException is re-thrown via Mono.error() so it propagates to AgentBase.createErrorHandler() which routes it to handleInterrupt() as intended.
…n (#951)
ReActAgent throws IllegalStateException when tool calls timeout or fail, because no tool result is written to memory, leaving orphaned pending tool call states that crash the agent on subsequent requests.
Root cause:
Changes:
This ensures the agent recovers gracefully from tool failures instead of crashing, and the model receives proper error feedback to continue processing.
Closes #951
AgentScope-Java Version
[The version of AgentScope-Java you are working on, e.g. 1.0.9, check your pom.xml dependency version or run
mvn dependency:tree | grep agentscope-parent:pom(only mac/linux)]Description
[Please describe the background, purpose, changes made, and how to test this PR]
Checklist
Please check the following items before code is ready to be reviewed.
mvn spotless:applymvn test)